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Abstract. We present a new data structure called the Compressed Ran- 
dom Access Memory (CRAM) that can store a dynamic string T of char- 
acters, e.g., representing the memory of a computer, in compressed form 
while achieving asymptotically almost-optimal bounds (in terms of em- 
pirical entropy) on the compression ratio. It allows short substrings of T 
to be decompressed and retrieved efficiently and, significantly, characters 
at arbitrary positions of T to be modified quickly during execution with- 
out decompressing the entire string. This can be regarded as a new type 
of data compression that can update a compressed file directly. More- 
over, at the cost of slightly increasing the time spent per operation, the 
CRAM can be extended to also support insertions and deletions. Our 
key observation that the empirical entropy of a string does not change 
much after a small change to the string, as well as our simple yet efficient 
method for maintaining an array of variable- length blocks under length 
modifications, may be useful for many other applications as well. 



1 Introduction 

Certain modern-day information technology-based applications require random 
access to very large data structures. For example, to do genome assembly in 
bioinformatics, one needs to maintain a huge graph 1181. Other examples include 
dynamic programming-based problems, such as optimal sequence aUgnment or 
finding maximum bipartite matchings, vifhich need to create large tables (often 
containing a lot of redundancy). Yet another example is in image processing, 
where one sometimes needs to edit a high-resolution image which is too big to 
load into the main memory of a computer all at once. Additionally, a current 
trend in the mass consumer electronics market is cheap mobile devices with 
limited processing power and relatively small memories; although these are not 
designed to process massive amounts of data, it could be economical to store 
non-permanent data and software on them more compactly, if possible. 

The standard solution to the above problem is to employ secondary memory 
(disk storage, etc.) as an extension of the main memory of a computer. This 
technique is called virtual memory. The drawback of virtual memory is that the 
processing time will be slowed down since accessing the secondary memory is 



an order of magnitude slower than accessing the main memory. An alternative 
approach is to compress the data T and store it in the main memory. By using 
existing data compression methods, T can be stored in nHk + o(nlog(T)-bits 
space [218] for every < fc < log^. where n is the length of T, a is the size of 
the alphabet, and Hk{T) denotes the fc-th order empirical entropy of T. Although 
greatly reducing the amount of storage needed, it does not work well because it 
becomes computationally expensive to access and update T . 

Motivated by applications that would benefit from having a large virtual 
memory that supports fast access- and update-operations, we consider the fol- 
lowing task: Given a memory/text T[l..n] over an alphabet of size u, maintain 
a data structure that stores T compactly while supporting the following opera- 
tions. (We assume that ^ — 0{\og„ n) is the length of one machine word.) 

• access(T, i): Return the substring T[i..{i + £ — 1)]. 

• replace(T, i, c): Replace T[i] by a character c G [a]. 

• delete(T, i): Delete T[i], i.e., make T one character shorter. 

• insert(T, i, c): Insert a character c into T between positions i~l and i, i.e., 
make T one character longer. 

Compressed Read Only Memory: When only the access operation is sup- 
ported, we call the data structure Compressed Read Only Memory. Sadakane 
and Grossi [T7], Gonzalez and Navarro |^, and Ferragina and Venturini ^ de- 
veloped storage schemes for storing a text succinctly that allow constant-time 
access to any word in the text. More precisely, these schemes store r[l..n] in 

nHk+o(n\oga ( ^ "1+'°^ " ) ) bit^and access(T,i) takes 0(1) time, and 
both the space and access time are optimal for this task. Note, however, that 
none of these schemes allow T to be modified. 

Compressed Random Access Memory (CRAM): When the operations 
access and replace are supported, we call the data structure Compressed Ran- 
dom Access Memory (CRAM). As far as we know, it has not been considered 
previously in the literature, even though it appears to be a fundamental and 
important data structure. 

Extended CRAM: When all four operations are supported, we call the data 
structure extended CRAM. It is equivalent to the dynamic array [16) and also 
solves the list representation problem [5]. Fredman and Saks [S] proved a cell 
probe lower bound of f?(logn/ loglogn) time for the latter, and also showed 
that n^*^^) update time is needed to support constant-time access. Raman et 
al. [16] presented an n log a+o{n log a)-hit data structure which supports access, 
replace, delete, and insert in ^^(logn/loglogn) time. Navarro and Sadakane 
[T5] recently gave a data structure using nHo{T) + 0{n log a/ log*^ n + a log*^ n) 
bits that supports access, delete, and insert in 0{ ^^°fj^ ^ (1 + i^°f^'^ „ ) ) time. 



* The notation [a] stands for the set {1, 2, . . . , a}. 

^ Reference J/Tj has a slightly worse space complexity. 



1.1 Our contributions 



This paper studies the complexity of maintaining the CRAM and extended 
CRAM data structures. We assume the uniform-cost word RAM model with word 
size w = 0{logn) bits, i.e., standard arithmetic and bitwise boolean operations 
on w-bit word-sized operands can be performed in constant time [S]. Also, we 
assume that the memory consists of a sequence of bits, and each bit is identified 
with an address in 0, . . . , 2*" — 1. Furthermore, any consecutive w bits can be 
accessed in constant time. (Note that this memory model is equivalent under the 
word RAM model to a standard memory model consisting of a sequence of words 
of some fixed length.) At any time, if the highest address of the memory used by 
the algorithm is s, the space used by the algorithm is said to be s -I- 1 bits [lO] . 

Our main results for the CRAM are summarized in: 

Theorem 1. Given a textT[l..n\ over an alphabet of size a and any fixed e > 0, 
after Oijilogcr / log n) time preprocessing, the CRAM data structure for T[l..n] 

can be stored in nHk{T)+0 [nloga {{k + l)e + '°s °,+'°f '"^ " ^ ^ bits for every 

< fc < log^ n simultaneously, where Hk{T) denotes the k-th order empirical 
entropy ofT, while supporting access(T, i) in 0(1) time and replace(T, i, c) 
for any character c in 0{l/e) time. 

Theorem [1] is proved in Section [5] below. 

Next, by setting e = max{|f|f , (^°V)Togn }^ obtain: 

Corollary 1. Given a text T[\..n\ over an alphabet of size a and any fixed 
k = o(log^n), after C(n log cr/ log n) time preprocessing, the CRAM data struc- 
ture for T[l..n] can be stored in nHk{T) + O ^^nloga • ^ '^]^g°n " ) while 
supporting access(T, i) in 0(1) time and replace(T, i, c) for any character c 
in 0(min{log^ n, (k + 1) logn/ log logn}) time. 

For the extended CRAM, we have: 

Theorem 2. Given a textT[l..n\ over an alphabet of size a, after 0{n\oga / logn) 
time preprocessing, the extended CRAM data structure for T[l..n] can be stored 
in nHk{T) + O (nloga ■ ^i2g£±(giii2gi£gli^ bUs for every < fc < log^n si- 

multaneously, where Hi,{T) denotes the k-th order empirical entropy ofT, while 
supporting all four operations in 0(logn/ log logn) time. 

Due to lack of space, the proof of Theorem [2] is given in Appendix [BJ 

Table [1] shows a comparison with existing data structures. Many existing 
dynamic data structures for storing compressed strings |7|11|13|1"5] use the fact 
nHo{S) = log (^^ " n ) '^here ric is the number of occurrences of character c in 
the string S. However, this approach is helpful for small alphabets only because 
of the size of the auxiliary data. For large alphabets, generalized wavelet trees [3] 
can be used to decompose a large alphabet into smaller ones, but this slows down 
the access and update times. For example, if cr = ^/n, the time complexity of 



Table 1. Comparison of existing data structures and the new ones from this paper. 
For simphcity, we assume a — o{n). The upper table hsts results for the Compressed 
Read Only Memory (the first line) and the CRAM (the second and third lines), and 
the lower table lists results for the extended CRAM. 
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those data structures is ©((logn/loglogn)^), while ours is ©(log ti/ log log n), 
or even constant. Also, a technical issue when using large alphabets is how to 
update the code tables for encoding characters to achieve the entropy bound. 
Code tables that achieve the entropy bound will change when the string changes, 
and updating the entire data structure with the new code table is too time- 
consuming. 

Our results depend on a new analysis of the empirical entropies of similar 
strings in Section [31 We prove that the empirical entropy of a string does not 
change a lot after a small change to the string (Theorem 2]). By using this fact, 
we can delay updating the entire code table. Thus, after each update operation 
to the string, we just change a part of the data structure according to the new 
code table. In Section [SJ we show that the redundancy in space usage by this 
method is negligible, and we obtain Theorem [T] 

Looking at Table [l] we observe that Theorem [T] can be interpreted as saying 
that for arbitrarily small, fixed e > 0, by spending 0{n log cr • e(fc + 1)) bits space 
more than the best existing data structures for Compressed Read Only Memory, 
we can also get C(l/e) (i.e., constant) time replace operations. 

1.2 Organization of the paper 

Section [5] reviews the definition of the empirical entropy of a string and the 
data structure of Ferragina and Venturini [4] . In Section [Sj we prove an im- 
portant result on the empirical entropies of similar strings. In Section |4] and 
Appendix lAl we describe a technique for maintaining an array of variable- length 
blocks. Section [5] and Appendix [B] explain how to implement the CRAM and the 
extended CRAM data structures to achieve the bounds stated in Theorems [T] 
and [5] above. Finally, Section [S] gives some concluding remarks. Experimental 
results that demonstrate the good performance of the CRAM in practice can be 
found in Appendix ICl 



2 Preliminaries 



2.1 Empirical entropy 

The compression ratio of a data compression method is often expressed in terms 
of the empirical entropy of the input strings [12 . We first recaU the definition of 
this concept. Let T be a string of length n over an alphabet A ^ [a]. Let ric be 
the number of occurrences of c €: .4 in T. Let {Pc = nc/n}'^^i be the empirical 
probability distribution for the string T. The 0-th order empirical entropy of T 
is defined as Ho{T) = —J2c=i -PdogPc- We also use Ho{p) to denote the 0-th 
order empirical entropy of a string whose empirical probability distribution is p. 

Next, let k be any non-negative integer. If a string s G precedes a symbol 
c in r, s is called the context of c. We denote by T*^''^ the string that is the 
concatenation of all symbols, each of whose context in T is s. The k-th order 
empirical entropy of T is defined as Hk{T) = ^T.seA" \T^'^Ho(T^^^). It was 
shown [Tl] that for any fc > we have Hk{T) > Hk^i{T) and nHk{T) is a lower 
bound to the output size of any compressor that encodes each symbol of T with 
a code that only depends on the symbol and its context of length k. 

The technique of blocking, i.e., to conceptually merge consecutive symbols to 
form new symbols over a larger alphabet, is used to reduce the redundancy of 
Huffman encoding for compressing a string. The string T of length n is parti- 
tioned into J blocks of length i each, then Huffman or other entropy codings 
are applied to compress a new string Ti of those blocks. We call this operation 
blocking of length i. 

To prove our new results, we shall use the following theorem in Section [S] 

Theorem 3 ([1, Theorem 16.3.2]). Letp and q be two probability mass func- 
tions on A such that \ \p — q\\i = J2c£A \Pi'^) ~ — \- Then \Hq{p) — Hi^{q)\ < 

-lb-g||iiog^^. 

2.2 Review of Ferragina and Venturini's data structure 

Here, we briefly review the data structure of Ferragina and Venturini from 
It uses the same basic idea as Huffman coding: replace every fixed-length block 
of symbols by a variable-length code in such a way that frequently occurring 
blocks get shorter codes than rarely occurring blocks. 

To be more precise, consider a text T[l..n] over an alphabet A where |^| = a 
and a < n. Let ^ = 5 log^ n and r = logn. Partition T[l..n] into super-blocks, 
each contains t£ characters. Each super-block is further partitioned into t blocks, 
each contains £ characters. Denote the j blocks by Ti = T[{i — 1)£ + l..i£] for 
i = l,2,...,n/£. 

Since each block is of length £, there are at most cr^ = y/n distinct blocks. 
For each block P G A^, let /(P) be the frequency of P in {Ti, . . . ,T„/f}. Let 
r(P) be the rank of P according to the decreasing frequency, i.e., the number of 
distinct blocks P' such that f{P') > f{P), and r~^{j) be its inverse function. 
Let enc{j) be the rank j-th binary string in [e, 0, 1, 00, 01, 10, 11, 000, . . .]. 

The data structure of Ferragina and Venturini consists of four arrays: 



• V ^ enc{r{Ti)) . . . enc{r{T„/e)). 

• r-^ij) for j ^l,...,y/n. 

• Table Tsfc;fc[l--77] stores the starting position in V of the encoding of every 
super-block. 

• Table Tt,ik[i..j] stores the starting position in V of the encoding of every 
block relative to the beginning of its enclosing super-block. 

The algorithm for access(T, i) is simple: Given i, compute the address where 
the block for T[i] is encoded by using Tsbik and Tbit and obtain the code which 
encodes the rank of the block. Then, from r~^, obtain the substring. In total, 
this takes 0(1) time. This yields: 

Lemma 1 (\4l). Any substring T[i..j] can be retrieved in 0{l+{j—i+l)/ log^ n) 
time. 

Using the data structure of Ferragina and Venturini, T[l..n] can be encoded 
using nHk + 0{j^^^^{k\og(T + log log n)) bits according to the next lemma. 

Lemma 2 ([4]). The space needed by V,r^^ ,Tgi,ik, and Tf^ik is as follows: 

• V is of length nHk + 2 + 0{k\ogn) + 0{nk\oga/£) bits, simultaneously for 
all < k < log^ n. 

• ''"^(i) for j — 1, . . . , \/n can be stored in ^/n\ogn bits. 

• Tsbik[^---ff] can be stored in 0{j) bits. 

• Tb;fe[l..j] can be stored in O(jloglogn) bits. 

3 Entropies of similar strings 

In this section, we prove that the empirical entropy of a string does not change 
much after a small change to it. This result will be used to bound the space 
complexity of our main data structure in Section 15.41 Consider two strings T 
and T' of length n and n', respectively, such that the edit distance between T 
and T' is one. That is, T' can be obtained from T by replacement, insertion, 
or deletion of one character. We show that the empirical entropies of the two 
strings do not differ so much. 

Theorem 4. For two strings T and T' on alphabet A of length n and n' respec- 
tively, such that the edit distance between T and T' is one, and for any integer 
k>0,\ nHkiT)~n'Hk{T')\^Oiik + l)ilogn + log\A\)). 

To prove Theorem m we first prove the following: 

Lemma 3. Let T be a string of length n over an alphabet A, T~ be a string 
made by deleting a character from T at any position, be a string made by 
inserting a character into T at any position, and T' be a string by replacing a 
character of T into another one at any position. Then the following relations 
hold: 

|ni/o(r)-(n-l)i/o(r-)| <41ogn + 31og|^| (z/n>l) (1) 
|nFo(T)-(n + l)i/o(T+)| <41og(n + l)+41og|^| (z/n>0) (2) 
|ni7o(r)-niJo(T')| <41og(n+l) + 31og|yl| {ifn>Q) (3) 



Proof. Let P{x), P~{x), P^{x), and P'{x) denote the empirical probability of 
a character x Cz A in T, , r+, and T', respectively, and let denote the 
number of occurrences of x £ A in T. It holds that P{x) — ^ for any x £ A. 

If a character c is removed from T, it holds that P^{c) = "''Zl , and P^{x) = 
^ for any other a; e A Then IIP - p-||i = ZT , = 

^^"~_"'=^^ . If n = 1, it holds iJo(r) 0, and therefore nHo{T)-{n-l)HQ{T-) = 
and the claim holds. If n = rtc, which means that all characters in T are c, it 
holds i?o(r) = Hq{T-) = Q and the claim holds. Otherwise, < H-P - 

^'"lli < f holds. If ||F - P-||i < i, from TheoremEl \Ha{P) - Ho{P-)\ < 
-||P-P-||ilog^^J^ < |log l-^l"(^"-^) . Then |ni/o(r) - - l)i/o(r-)| < 
n|i?o(-P) - i?o(P'")| +Ha{p-) < 4 log n + 3 log If ||P-P-||i > i, which 
implies n < 4, \nHoiT) - (n - l)i7o(I'")| < 3 log \A\. This proves the claim for 
T-. 

If a character c is inserted into T, it holds that P^(c) — "^^^ , and P+(a;) — 
^ for any other x e A. Then ||P - P+||i = If n = 0, ilo(r) = 

Ho{T^) — and the claim holds. If n = Uc, which means that r+ consists 
of only the character c, Ho{T) = Ho{T^) — and the claim holds. Otherwise, 
< ll^-^+lli < f holds. If ||P-P+||i < i, \nHoiT) - {n + l)Ho{T+)\ < 
n\HoiP) - HoiP+)\+ Ho{p-) < 41ogn + Slog If ||P-P+||i > i, which 
implies n < 4, |ni/o(P) - [n + 1)Hq(T+)\ < 4 log \A\. This proves the claim for 
T+. 

If a character c of T is replaced with another character c' E A (c' ^ c), 

ll^-^'lli = E„e.Al^(«)-^'(«)l = Itt-^I + I^-^I = I- If 

\\P-P'\\i < i \nHo{T)-nHo{T')\<n\HoiP)-Ho{P')\<4.logn + 2\og\A\. 
If ||P-P'||i > i, which implies n < 4, \nHo{T) -nHa{T)\ < Slog If c' = c, 
T' = T and \nHo{T) - niJo(T')l = 0- This completes the proof. □ 

By using this lemma, we prove the theorem. 

Proof, (of Theorem HJ From the definition of the empirical entropy, nHk(T) = 
Y.seA" \T'''^\Ho{T'-'^). Therefore for each context s € A, we estimate the change 
of 0-th order entropy. 

Because the edit distance between T and T' is one, these are expressed 
as T = T1CT2 and T' — Tic'T2 by two possibly empty strings Ti and T2, 
and possibly empty characters c and c'. For the context Ti[ni — fc + 
(ni — |Ti|), denoted by so, the character c in the string T^^f) will change 
to c'. The character T2[i] {i = l,2,...,fc) has the context Ti[ni — fc + 1 + 
i..ni]cT2[l..i — 1], denoted by Si, in T, but the context will change to = 
Ti[ni — fc + 1 + i..ni]c'T2[l..i — 1], in T' . Therefore a character T2[i] is removed 
from the string T(*'\ and it is inserted to T'^*'^. Therefore in at most 2fc + 1 
strings {T^'">\t'-''^\ . . . , r(^'=), T'(^i), . . . ,r'(*'=)), the entropies wih change. From 
Lemma[3l each one will change only ^^(logn+log |-4|). This proves the claim. □ 



4 Memory management 



This section presents a data structure for storing a set B of m variable-length 
strings over the alphabet {0, 1}, which is an extension of the one in 151. The 
data structure allows the contents of the strings and their lengths to change, 
but the value of m must remain constant. We assume a unit-cost word RAM 
model with word size w bits. The memory consists of consecutively ordered bits, 
and any consecutive w bits can be accessed in constant time, as stated above. A 
string over {0, 1} of length at most b is called a (< b)-block. Our data structure 
stores a set B of m such (< 6)-blocks, while supporting the following operations: 

• address(i): Return a pointer to where in the memory the i-th (< 6)-block 
is stored {1 < i <m). 

• realloc(«, b'): Change the length of the i-th (< 6)-block to b' bits (0 < i < 
m). The physical address for storing the block (address(i)) may change. 

Theorem 5. Given that b < m and logm < w, consider the unit-cost word 
RAM model with word size w. Let B ^ {B[l], B[2], . . . , B[m]} be a set of (< b)- 
blocks and let s be the total number of bits of all (< b)-blocks in B. We can 
store B in s + 0{m\ogm + b^) bits while supporting address in 0(1) time and 
realloc in 0{b/w) time. 

Theorem 6. Given a parameter b = 0{w), consider the unit-cost word RAM 
model with word size w. Let B = {B[l\, B[2\, . . . , _B[m]} be a set of (< b)-blocks, 
and let s be the total number of bits of all (< b)-blocks in B. We can store B in 
s -\- 0{w'^ -\- mlogw) bits while supporting address and realloc in worst-case 
0(1) time. 

(Due to lack of space, the proofs of Theorems [S] and [S] are given in Appendix El) 
From here on, we say that the data structure has parameters (5, m). 

5 A data structure for maintaining the CRAM 

This section is devoted to proving Theorem [T] Our aim is to dynamize Ferragina 
and Venturini's data structure [4 by allowing replace operations. (Our data 
structure for the extended CRAM which also supports insert and delete is 
described in Appendix IbI) Ferragina and Venturini's data structure uses a code 
table for encoding the string, while our data structure uses two code tables, 
which will change during update operations. 

Given a string T[l..n] defined over an alphabet A {\A\ — cr), we support 
two operations. (1) access(T, i): which returns T[i..i -\- ^log^n — 1]; and (2) 
replace(T, i, c): which replaces T[i] with a character c £ A. 

We use blocking of length i = ^ log^ n of T. Let T'[l..n'] be a string of length 
n' = J on an alphabet A^ made by blocking of T. The alphabet size is = ^Jn. 
Each character r'[i] corresponds to the string T\{{i — 1)^+ l)..i£]. A super-block 
consists of 1/e consecutive blocks in T' {Ijt consecutive characters in T), where 
e is a predefined constant. 



Our algorithm runs in phases. Let n" = en' . For every j > 1, we refer to 
the sequence of the {n"{j — 1) + l)-th to (n"j)-th replacements as phase j. The 
preprocessing stage corresponds to phase 0. Let T^^-' denote the string just before 
phase j. (Hence, T^^^ is the input string T.) Let F*^-'^ denote the frequency table of 
blocks b £ in T^^\ and C^^'' and Z?^^-' a code table and a decode table defined 
below. The algorithm also uses a bit-vector where = 1 

means that the i-th super-block in T is encoded by code table C^^"^-'; otherwise, 
it is encoded by code table C^^"^-'. 

During the execution of the algorithm, we maintain the following invariant: 

• At the beginning of phase j, the string T^^^ is encoded with code table C^i~'^^ 
(we assume 

(ji-i) = (7(0) ^ (-(1))^ j^jjj ^jjg ^^^Y^ pU) g^-Qj.gg tjjg frequencies 
of blocks in T^^). 

• During phase j, the i-th super-block is encoded with code table C'--'"^-' if 

= 0, or C(J-i) if = 1. The code tables C^^-^) and C^'^~^^ 

do not change. 

• During phase j, stores the correct frequency of blocks of the current T. 

5.1 Phase 0: Preprocessing 

First, for each block b ^ , we count the numbers of its occurrences in T' and 
store it in an array F^^^\b]. Then we sort the blocks & e in decreasing order 
of the frequencies and assign a code C(-^^[6] to encode them. The code 

for a block b is defined as follows. If the length of the code enc{b), defined in 
Section is at most ^logn bits, then C'^"'^^[&] consists of a bit '0', followed by 
enc{b). Otherwise, it consists of a bit '1', followed by the binary encoding of b, 
that is, the block is stored without compression. The code length for any block b 
is upper bounded by l-f i log n bits. Then we construct a table D^^'> for decoding 
a block. The table has 2i+^^°s" = 0{y^) entries and = b for aU binary 

patterns x of length 1 + | logn such that a prefix of x is equal to C^^^ [b]. Note 
that this decode table is similar to defined in Section [2?2] 

Next, for each block T'[i] (j = 1, . . . , n'), compute its length using C^^^ 
allocate space for storing it using the data structure of Theorem |6] with param- 
eters (1 -t- £logf7, ^) = {1 + ilogn, ^"pgf^*^ ), and w = logn. From Lemma[l]and 
Theorem [6l if follows that the size of the initial data structure is nHk (T) -|- 
O ^ (fc log cr + log log rt)^ bits. Finally, for later use, copy the contents of 

F^^") to F^^', and initialize R^'^^ by 0. By sorting the blocks by a radix sort, the 
preprocessing time becomes ©(nlogcr/ logn). 

5.2 Algorithm for access 

The algorithm for access(r, i) is: Given the index i, compute the block number 
X = [(i — 1)/^J + 1 and the super-block number y containing T[i]. Obtain the 
pointer to the block and the length of the code by address (a;). Decode the block 
using the decode table D^^-^) if i?(J-i)[x] = 0, or ^(J-i) if = 1. This 

takes constant time. 



5.3 Algorithm for replace 

We first explain a naive, inefficient algorithm, li b — T'[i] is replaced with b' , 
we change the frequency table F*^^^ so that is decremented by one and 

is incremented by one. Then new code table C'^-* and decode table D^^^ 
are computed from updated F'^^\ and all blocks T'[j] {j = l,...,n') are re- 
encoded by using the new code table. Obviously, this algorithm is too slow. 

To get a faster algorithm, we can delay updating code tables for the blocks 
and re- writing the blocks using new code tables because of Theorem 2] Because 
the amount of change in entropy is small after a small change in the string, we 
can show that the redundancy of using code tables defined according to an old 
string can be negligible. For each single character change in T, we re-encode a 
super-block (£/e characters in T). After en' changes, the whole string will be 
re-encoded. To specify which super-block to be re-encoded, we use an integer 
array G^^"^'' [l..n"]. It stores a permutation of (1, . . . ,n") and indicates that at 
the x-th replace operation in phase j we rewrite the G'--'""'^-' [x]-th super-block. 
The bit R'^^-^'^[x] indicates if the super-block has been already rewritten or not. 
The array G'-j-^^ is defined by sorting super-blocks in increasing order of lengths 
of codes for encoding super-blocks. 

We implement replace(r, i, S) as follows. In the x-th update in phase j, 

1. If i?(J-i)[G(^-i)[x]] = 0, i.e., if the G^^-^^x]-th. super-block is encoded with 
G(^-2) , decode it and re-encode it with C^^'^^ , and set [G^^'^^^ [x]] = 1. 

2. Let y be the super-block number containing T[i], that is, y — [e(i — 

3. Decode the y-th super-block, which is encoded with C'^^~^^ or G^^"^-' de- 
pending on i?'--'"^^ [y]. Let S' denote the block containing T[i\. Make a new 
block S from S' by applying the replace operation. 

4. Decrement the frequency F^J~^^'>[S'] and increment the frequency F^^'^^'>[S]. 

5. Compute the code for encoding S using G(^-i) if the y-th super-block is 
already re-encoded (R^^^^^'ly] — 1), or G^^~^-' otherwise (i?'^^"^-' [y] — 0). 

6. Compute the lengths of the blocks in y-th super-block and apply realloc 
for those blocks. 

7. Rewrite the blocks in the y-th super-block. 

8. Construct a part of tables Gf-^), D^^\ G''J\ and (see below). 

To prove that the algorithm above maintains the invariant, we need only to 
prove that the tables G^^-^), F^^), and G^^-^) are ready at the beginning of phase 
j. In phase j, we create G'^-'-' based on F'^-'^ . This is done by just radix-sorting the 
frequencies of blocks, and therefore the total time complexity is 0{a^) = 0{^Jn). 
Because phase j consists of n" replace operations, the work for creating G'--'-' 
can be distributed in the phase. We represent the array G'-'^^-' implicitly by 
(l/e)(l -I- ^ logn) doubly-hnked hsts Ld\ Ld stores super-blocks of length d. By 
retrieving the lists in decreasing order of d we can enumerate the elements of G'^-'^ . 
If all the elements of a list have been retrieved, we move to the next non-empty 
list. This can be done in 0(1/6) time if we use a bit-vector of (l/e)(l -I- ^ logn) 
bits indicating which lists are non-empty. We copy F'-') to F'''+^) in constant 
time by changing pointers to F'^-'^ and F^^+^'. For each replace in phase j, we 



re-encode a super-block, which consists of 1/e blocks. This takes 0{l/e) time. 
Therefore the time complexity for replace is 0{l/e) time. 

Note that during phase j, only the tables F^i\ C^^-^\ C^^-i), C^^\ 

D(j-2)^ dO), G^^-^\ G^'\ R^^-^\ and i?^^) are stored. The other tables 

are discarded. 

5.4 Space analysis 

Let s(T) denote the size of the encoding of T by our dynamic data structure. At 
the beginning of phase j, the string T'^^^ is encoded with code table C'--'"^^ , which 
is based on the string T^j-^) . Let L^^^ = nHk{T^3^) and i'^-^) ^ nHk(T^j-^^). 

After the preprocessing, s(T(i)) < L^i) O (^^f^ (fc log cr -flog log n)) . If 
we do not re-encode the string, for each replace operation we write at most 
l-f i logn bits. Therefore s(r(-')) < s(T(J-2)) + 0{n" Xogn) holds. Because T'-J) 
is made by 2(n" + ^/n) character changes to T^^^^\ from Theorem |4l we have 
\LU)-lU-2)\ = 0(n"(yfc-fl)(logn+logCT)). Therefore we obtain s(T(^)) < L'^^^ + 
0{e{k + l)nlogcr). The space for storing the tables F^^), C^i\ D^3\ G^^\ H'^^\ 
and is 0(V"logn), ©(V^logn), ©(^^logn), 0[n"\ogn) = O(enlogcr), 
C(n"logn), 0{n") bits, respectively. 

Next we analyze the space redundancy caused by the re-encoding of super- 
blocks. We re-encode the super-blocks with the new code table in increasing order 
of their lengths, that is, the shortest one is re-encoded first. This guarantees that 
at any time the space does not exceed max{s(T(-'^), s(T(-'~^))}. This completes 
the proof of Theorem [TJ 

6 Concluding remarks 

We have presented a data structure called Compressed Random Access Mem- 
ory (CRAM), which compresses a string T of length n into its fc-th order 
empirical entropy in such a way that any consecutive log^. n bits can be ob- 
tained in constant time (the access operation), and replacing a character (the 
replace operation) takes 0(min{logg. n, [k + 1) logn/ log log n}) time. The time 
for replace can be reduced to constant {0{\/e)) time by allowing an additional 
C'(e(fc-M)nlogcr) bits redundancy. The extended CRAM data structure also sup- 
ports the insert and delete operations, at the cost of increasing the time for 
access to ©(log?!/ loglogn) time, which is optimal under this stronger require- 
ment, and the time for each update operation also becomes ©(logrt/ loglogrt). 

An open problem is how to improve the running time of replace for the 
CRAM data structure to C(l) without using the 0{e{k + l)nlogCT) extra bits. 
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Appendix 



A Memory management 

This appendix proves Theorems [S] and [5] from Section 

We first describe some technical details. (Recall the definitions from Sec- 
tion 01) Let p = 0{log{mb)). This is the number of bits needed to represent 
a memory address relative to the head of a memory region storing B. For any 
(< 6)-block B[i] in B, we will refer to the current contents of B[i] by data{i). To 
store data{i) for all (< 6)-blocks compactly while allowing efficient updates, we 
allocate memory in such a way that data{i) for any given (< 6)-block B[i] may 
be spread out over at most two non-consecutive regions in the memory. For this 
purpose, define a segment to be 6 -f 4p consecutive bits of memory. The data 
structure will always allocate and deallocate memory in terms of segments, and 
keep a pointer to the start of the segment currently at the highest address in the 
memory. 

The core of our data structure is b doubly-linked lists Li, . . . , L^, where each 
list is a doubly-linked list of a set of segments, each is of length b+ip bits. The 
main idea is to use the segments in list to store all (< &)-blocks of length x. 
Moreover, we do so in such a way that when the length of a (< 6)-block changes 
from a; to a; -|- d, we only need to update a few segments in the two lists Lx 
and Lx+d- 

Every segment belonging to a list L.^ is used as follows: 

• pred: p bits to store the memory address of its predecessor segment in L^- 

• succ: p bits to store the memory address of its successor segment in L^- 

• block-data: b + p bits for storing information associated with one or more 
(< &)-blocks of length x. More precisely, for i G [m], let id{i) denote the 
binary encoding of the integer i in logm < p bits. For every x G [b], form a 
string B* by concatenating the pairs {id{i) , data{i)) for all (< &)-blocks B[i\ 
of length X in some arbitrary order. Divide into substrings of length b 
and store them in the block-data region of consecutive segments in Lx- Only 
the first segment in each list Lx is allowed to have some unused bits in its 
block-data region. 

• offset: log6(< p) bits to store the relative starting position within the 
block-data-iegion of the first (< &)-block. Observe that the head offset bits 
of the segment are used to store another (< 6)-block, whose starting position 
belongs to another segment, except the first segment in Lx- 

See Fig. [T] for an illustration. Note that a (< 6)-block B[i] may stretch across 
two segments in Lx, and that these two segments might not be located in a 
consecutive region of the memory. 

In addition to the doubly-linked lists, we use four arrays Seg[l..m], Pos[l..m], 
Len[l..m], and Lnd[l..m] to remember the locations of the (< 6)-blocks. To be 
precise, for each i £ [to], we have 

• Seg[i] contains a pointer j to Ind[j]. 



(empty) id data • • • id data 




hlockdata 
for one segment 



Fig. 1. Each doubly- linked list consists of segments which together store all (< b)- 
blocks of length x in their block^data-regiorm (shown here). The data for a single (< &)- 
block may be split across two segments and the first segment in may contain some 
unused bits. 

• Pos\i] contains the relative location of B[i] from the starting position of the 
segment. 

• Len[i] stores the length of B[i]. 

• Ind[j] stores the memory address of the segment which stores blocks B[i] 
with Seg [i] — j . 

The array Ind is used for indirect addressing. Consider the case that we do 
not use the array Ind and store memory addresses in Seg directly. When a 
segment has moved to another address, we have to rewrite non-constant number 
of Seg[iys for i's such that the i-th block is stored in the segment. On the other 
hand, if Ind is used, we need not to rewrite all Seg[i]'s and it is enough to rewrite 
only an entry Ind[j] (j = S'e5[z] for those i's). 

Proof, (of Theorem[5]) To implement address(i) in 0(1) time, we simply return 
three values Q — Seg[Ind[i]], Pos[i], and Len[i]. Observe that B[i] is stored at 
position Pos[z] of the segment Q. Moreover, when Pos[i] -I- Len[i] > b + p, the 
(< 5)-block B[i] spans across two segments (i.e. segments Q and Q.succ). In this 
case, the first Pos[i] -f Len[i] — b~p bits of B[i] are stored at the end of segment 
Q while the remaining bits of B[i] are stored at the beginning of segment Q.succ. 

To implement realloc(i, 6') in 0{b/w) time, we first find the location q of 
B[i] by address(z). Store data{i) in a temporary space tmp. Obtain the index j 
of the first (< 5)-block B[j] in from the block_data of the first segment in L^. 
Copy the first (< 6)-block B[j] in to position q and update Seg[j] and Pos[j] 
accordingly. If the first segment in becomes empty, move the segment with the 
largest address to the empty location. Let s and t be the addresses of the segment 
before and after the movement, respectively. We have to change pointers to all 
(< 6)-blocks B[j'] stored in the moved segment. Though the moved segment may 
contain 0{b) (< 6)-blocks, this can be done in constant time as follows. For any 
(< 6)-block B[j'] in the segment, it holds Ind[j'] — r for some r, and Seg[r] = s 
before the movement. Therefore it holds Seg[Ind[j']] = s for all the blocks. After 
the movement, it must hold that Seg[Ind[j']] = t for all the blocks. This is done 
in constant time by simply setting Seg[r] = t and we need not change Ind[j'] for 



each j' . Each value r — Ind[j'] is an integer in the range [0, m] and corresponds 
to a segment. Precisely, if the data structure currently has r segments, and a 
new segment is allocated for B[j], we set Ind[j] — r + 1 and Seg[r + 1] is set 
to the address of the new segment. If the blocks in the (r + l)-st segment have 
been moved, we change Seg[r + 1]. 

Next, copy the x bits stored in tmp to immediately before the first (< 6)- 
block in the head of the list Lr^^^. In case the first segment of L.j.j^4 does not 
have enough space to store it, we allocate a new segment. Finally, update Seg\i]^ 
Pos[i\, Len[i], and Ind[i]. The time complexity is 0{ — ). This is necessary to 
copy segments of 0{b + p) bits. 

To analyze the space complexity, note that the total length of data{i) for 
all B[i] G B is s. Each list has at most one segment which has an empty 
slot, and therefore there exist at most b segments with an empty slot. Then the 
number of segments to store s bits is at most s/b + b and the space to store the 
segments is {b + 4p){s/b + b) = s(l + ^) + b{b + 4p). This is 5 + 0(62 + m log m) 
because p — \og{mb) — ©(log to) and s < bm. We also need Amp — ©(m log to) 
bit space for storing Segll-.m], Ind[l..ni\, Pos[1..to], and Len[l..m]. In total, 
the total space is s + 0{b^ + mlogm) bits. This proves Theorem [5] □ 

The data structure of Theorem [5] uses ©(mlogm) bits additional space, 
which is too large if we want to store many short blocks. Because of this, we 
give an alternative data structure in Theorem [S] We say the data structure has 
parameters (6, m). 

Proof, (of Theorem[B]) We use Theorem[5]to store a set of (< 6)-blocks B[l..m] 
for b = 1 + w and m = w'^ for some constant c > 0. We say the data structure has 
parameters (1 + w, w''). For general to > w'^, we split _B[1..to] into to' = [to/w'^] 
sub-arrays B[[l..'w'^], B'2[l..w''], . . . , B'^^,[l..w'^] and to store each sub-array B- 
we use Theorem [5] with parameters {b,m) = (1 + w,w'^). Then address(i) and 
realloc(i, b) are done in constant time. However a problem is that the memory 
space to store the data structure for each sub-array will change, and we cannot 
store it in a consecutive memory region. To overcome this, we use a two-level 
data structure. Let M = \^~\- The higher level consists of M data structures 
of Theorem [5] with parameters (1 -I- w, w"^). We call each one Di (i = 1, . . . , M). 
Each Di uses a consecutive memory region, which is impossible. Therefore we 
use a kind of virtual memory. The memory to store segments is divided into 
pages. Each page uses a physically consecutive memory region, while the pages 
are located in non-consecutive regions. Each data structure Di has pages, and 
each page contains either w segments or no segments, depending on how many 
bits are necessary to store the blocks. Precisely, if s segments are necessary, the 
first \s/w~\ pages have w segments each, and the rest have no segments. The 
total number of pages for all A (« = 1, • • ■ , M) is Mw"^ = \^~\ . 

The whole pages of all the data structures Di {i = 1, . . . , M) are managed 
by a single data structure of Theorem [S] with parameters {0{w'^), \^^). We call 
the data structure D. The algorithm for address(i) becomes as follows. Let 
q = [^^\ + 1 and r — i — {q — l)w^ . The i-th block is stored as the r-th block 



of Dq. Therefore we compute address (r) in Dq^ and obtain the logical (virtual) 
address x of the segment containing the block. To convert it into the physical 
(real) address y of the memory, we first compute the page number z for the 
segment, then compute address(z) in D. It is straightforward to compute the 
address for the block inside the page because the segments in the page are of 
the same length. These operations are done in constant time. 

The function realloc(i, b) is implemented as follows. First we find the data 
structure Dq that contains the i-th block, and execute realloc in Dq. Note that 
Dq uses virtual memory which is managed by D. Therefore we have to convert 
a logical address to a physical one for any memory access. During the execution 
of realloc in Dq, we have to move a constant number of pages of bits in 
D. If we naively use the result of Theorem [SJ it takes 0{w) time. It is easy to 
obtain an amortized 0(1) time algorithm because each block is of 0{w) bits 
and movement of pages in D occurs every 0{'w) operations of realloc in Dq. 
It is also easy to obtain a worst-case 0(1) time algorithm at the cost of O(w^) 
bit redundant space for each Dq. We can spread the movement of pages over 
the next 0(w) execution of realloc in Dq. Then a constant number of pages 
with empty slots will exist, resulting in O(w^) bit redundancy in space. This 
redundancy sums up to 0(^ ■ w'^) = O(^) bits for all Dq. 

We analyze the space complexity. Each Dq has 0(it;^ logu')-bit auxiliary 
data structures, and they sum up to 0{m\ogw) bits. The data structure D uses 
5 + 0(w^+ "'^°g"' ) bits. Therefore the total space is s + 0(w'* + TOlogw) bits. □ 

B A data structure for maintaining the extended CRAM 

This section proves Theorem [2] To support insert and delete efficiently, we 
use variable length super-blocks. Namely, let r = „ ^ = 5 logg. n. 

Each super-block consists of r to 2t blocks (r^ to 2t(. consecutive characters 
in T). These super-blocks are stored using the data structure of Theorem [5] 
with parameters (2 log^ n / log log n, n log log n / log^ n) . To represent super- block 
boundaries, we use a bit-vector B[l..n] such that B[i\ = 1 means that T[i\ is 
the first character in a super-block. Therefore B has ©(nlogaloglogn/log^ n) 
ones. This bit-vector is stored using the following data structure. 

Lemma 4 (Lemma 17 [15j). We can maintain any hit-vector B[l..n] within 
nHo^B) -|- 0(n log log n/ logn) bits of space, while supporting the operations 
rank, select, insert, and delete, all in time O(logn/loglogn). 

Because nHo{B) — O(nloglogn/logr7,), the bit-vector can be stored in 
O(nloglog7i/logn) bits, and rank(i?, i) can be computed in O (log n/ log log n) 
time, where rajik(i?, i) is the number of ones in i3[l..i]. By using this data struc- 
ture, we can compute the super-block number containing T[i] by rank(_B, i). 

The algorithm for insert and delete in the extended CRAM works as 
follows. First, we re-write the super-block in which insert or delete occurs. 
If it contains less than r or more than 2t blocks, we merge two consecutive 
super-blocks or split it into two to maintain the invariant that every super-block 



consists of T to 2t blocks. If the lengths of super-blocks change, we update 
the bit-vector B accordingly. Because only a constant number of super-blocks 
change, the time complexity is ©(log 71/ log log n). 

However, in the extended CRAM, the time complexity of access(T, i) in- 
creases from 0(1) to ©(logri/loglogn): We first compute the super-block num- 
ber for T[i] by using rank(i3,i) in ©(log n/ log log ri) time. Then, we scan all 
the blocks in the super-block to find the location storing T[i..i + £ — 1]. Fur- 
thermore, for each replace(T, i, c) operation, we re-encode a super-block. This 
means that we set e = ©(log log n/ log n) in Theorem [TJ The time complexity 
becomes O (log n/ log log n). 

We also have to consider "the change of log n problem" . The sizes of blocks 
and super-blocks depend on log n, the length of the string. If n changes a lot, log n 
also changes, and we have to reconstruct the data structure for the new value 
of logn. To avoid this, we use the same technique as Makinen and Navarro [T^ . 
We partition the string T into three parts Ti,T2,T3, and encode them using 
logn — l,logri,logn + 1 as their "logn" values, respectively. We maintain the 
following invariant that if n is zero or a power of two, Ti and are empty 
and T2 is equal to T, and if n increases by one, the length of T3 grows by 
two and that of Ti shrinks by one. To accomplish this, depending on where 
an insert occurs, we move the rightmost one or two characters of Ti to the 
beginning of T2, and the rightmost one or two characters of T2 to the begin- 
ning of Ta. A deletion is done similarly. We can guarantee that if the string 
length is doubled, "logn" increases by one. For example, if n is a power of 
two then all the characters belong to T2. Then after n insertions, the length 
becomes 2n and all the characters move to T3, and now becomes the new 
T2. The strings Ti, T2, and T3 are stored using the data structure of Theo- 
rem[5]with parameters (2(logn— 1)^/ log(logn— 1), n log(logn— 1) / (logn— 1)^), 
(2 (log n)^/ log log n, n log log n / (log n)^), (2 (log n+1)-^/ log(log n+l),n log(log n+ 
1) /(logn -1-1)^), respectively. The asymptotic space and time complexities do not 
change. 

C Experimental results 

To test the performance of the CRAM data structure in practice, we implemented 
a prototype named cram and compared it to a gzip-based alternative method 
named cramgz. For the implementations, we used the C programming language. 
In this appendix, we describe the details of the experiments and the outcome. 

1. crcun is the data structure that we introduced in this paper, but a few changes 
were made to make it easier to implement. (These simplifications may reduce 
the performance of the method slightly; however, as shown below, it is still 
excellent.) To be precise, the crsun-prototype works as described in the paper 
but with the following modifications: 

• The memory is partitioned into large blocks^ and the size of each large 
block is denoted by b. Each large block is further partitioned into middle 



blocks of length m. Finally, each middle block stores blocks of length 
i ^ 2, instead of i log^, n, which correspond to the "blocks" described in 
the paper. 

• Pointers to large blocks and middle blocks are stored. 

• Each block is encoded by a Huffman code. We assign codes to all char- 
acters in the alphabet because otherwise we cannot encode characters 
that are missing from the initial string but later appear due to replace 
operations. 

• An additional parameter u is employed as follows. If x bytes of memory 
change occur, then x ■ u bytes are re-encoded with new codes. Further- 
more, if j: • u > 5, a large block is re-encoded. (This means that the 
worst-case time complexity of replace becomes amortized, so it is no 
longer constant.) 

2. Next, crsungz is an original data structure based on gzip: 

• The memory is partitioned into large blocks of length b' . 

• Each large block is compressed by gzip and stored using our dynamic 
memory management algorithm. We employed the zlib librarjij, with 
compression level 1 (i.e., the fastest compression). 

For the experiments, we used the two text files English and DNA from the Pizza 
& Chih corpusQ The lengths of these texts are limited to 200MB, so n = 200-22° 
and the alphabet size a — 256. 

Experiment 1: The first experiment only involved crarni and was done to de- 
termine a suitable value for the parameter u. We first built the CRAM data 
structure for the English text using the parameters b = 1024, m = 64. Then, we 
overwrote the DNA text onto the (compressed) English text one character at a 
time, from left to right, by using replace operations. The 1-st order entropy of 
the initial text was about 4 bits per character (bpc), and including the auxil- 
iary data structure, the size was about 4.67 bpc. On the other hand, since the 
DNA text only consists of four distinct characters {a, c, g, t} and the text is 
almost random, its 1-st order entropy is 2 bpc and about 2.67 bpc including the 
auxiliary data structure. 

Fig. [2] shows the outcome. As expected, when the text is modified, its 1-st 
order entropy (shown in the curve 'entropy') changes. The other lines in the 
diagram show the data structure's change in size for various values of the pa- 
rameter u. If u = 1, the code never changes because we write the whole DNA 
text before the end of Phase 1 (see Section [5] for the explanation of "phase"). In 
other words, all the characters in the DNA text are encoded by the optimal code 
for the English text, and the resulting size is bad. If u > 1, the code is updated 
while the text is modified, and the size converges to the entropy (plus the size 
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Fig. 2. Results for Experiment 1. The a;-axis represents the ratio of overwritten text; 
is the initial situation where the entire text is from the English text, 50 means that 
the left half of the text has been changed to the DNA text while the right half is still 
from the English text, and 100 is when the whole text has become the DNA text. 



of the auxiliary data structure). We can see that as the value of u increases, the 
size converges towards the entropy more quickly but the time needed to update 
the data structure also increases. We select m = 4 as a good trade-off between 
the convergence speed and the update time. 

Experiment 2: The second experiment compared the access and replace 
times for cram and cramgz. We set the parameters of the data structures so that 
both have the same size. For cram, we set b — 1024, m = 64, u — A. Then, the 
English text has size 4.67 bpc. To achieve the same compression ratio (4.67 bpc) 
for crsungz, we had to select h' = 1024. 

We performed two types of experiments: one to evaluate access by mea- 
suring the time needed to read the entire compressed English text, and one to 
evaluate replace by measuring the time needed to overwrite the DNA text onto 
the compressed English text. We combined series of consecutive operations into 
single read/write unit operations, and tried various sizes of read/write units 
smaller than the block size. Fig. [3] shows the size of each read/write unit and 
the resulting time for access and replace. In cramigz, large blocks of length 
h' — 1024 bytes are directly compressed by gzip, and reading any unit shorter 
than 1024 bytes still requires decoding the whole large block. Therefore, access 
is not very efficient when using units shorter than b' bytes. Similarly, writing a 
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Fig. 3. Results for Experiment 2. Tlie x-axis siiows tlie size of tlie unit for read/write, 
and tlie j/-axis shows the resulting time needed to read the whole text by access 
operations or to overwrite the whole text by replace operations, respectively. 



short unit requires decoding a large block, rewriting a part of it, and then en- 
coding the whole large block again. On the other hand, in cram, the base is the 
middle block (i.e., to = 64 bytes) and therefore it is more efficient than cramigz. 
Fig. [3] shows that cram is faster than cramgz for all read unit sizes, and faster 
for 16 to 256 bytes write unit sizes. 



